Capstone Project


RSNA Pneumonia Detection Challenge in dicom images

Final report by Group 4 (AIML Online May 2021)

Team : TECH PHANTOM

Members :

Under the supervision of Sarvanan Devendran

GitHub Link

Introduction

To carry out an effective study on Pneumonia detection, one must understand about the disease, what causes it and what are the technical jargons used in this filed of study. In this section, we’ll answer in brief about these topics.

1. What is Pneumonia?

Pneumonia is an infection that inflames the air sacs in one or both lungs. The air sacs may fill with fluid or pus (purulent material), causing cough with phlegm or pus, fever, chills, and difficulty breathing. A variety of organisms, including bacteria, viruses and fungi, can cause pneumonia.

picture

2. Symptoms of Pneumonia.

The signs and symptoms of pneumonia vary from mild to severe, depending on factors such as the type of germ causing the infection, and your age and overall health. Mild signs and symptoms often are similar to those of a cold or flu, but they last longer.

Signs and symptoms of pneumonia may include:

3. How Pneumonia is detected?

In our study, we will focus our research on how Pneumonia is detected with the help if chest X-rays. In below image, we can see how pneumonia is read by the doctors or radiologists in x-rays.

4. Opacity:

"Opacity refers to any area that preferentially attenuates the x-ray beam and therefore appears more opaque than the surrounding area. It is a nonspecific term that does not indicate the size or pathologic nature of the abnormality."
Any area in the chest radiograph that is whiter than it should be. If you compare the images of Sample Patient 1 and Sample Patient 2 you can see that the lower boundry of the lungs of patient 2 is obscured by opacities. In the image of Sample Patient 1 you can see the clear difference between the black lungs and the tissue below it, and in the image of Sample Patient 2 there is just this fuzziness.

Summary of Problem statement, data and findings

Problem Statement:

'Pneumonia' is an infection in the lung, which requires review of a chest radiograph by highly trained specialists. Pneumonia shows up in a chest radiograph as an area of opacity. However, diagnosing it can be complicated and much time and effort is spent by specialists in reviewing X-rays. Chest radiograph is the most common performed diagnostic imaging study. Due to the high volume of chest radiography, it is very time consuming and intensive for the radiologists to review each image manually.

As such, an automated solution is ideal to locate the position of inflammation in an image. By having such an automated pneumonia screening system, this can assist physicians to make better clinical decisions or even replace human judgement in this area.

About the Dataset:

The link for the dataset and more information can be found here - https://www.kaggle.com/c/rsna-pneumonia-detection-challenge

The data is given in a zip file “rsna-pneumonia-detection-challenge.zip”, which contains the following items:

  1. stage_2_train_images:
    This folder contains all the training dataset as chest radiograph DICOM images. Lung tissue which are full of air do not absorb x-ray and appear black in colour. Dense tissues absorb x-rays and appear white in colour.
  1. stage_2_train_labels.csv:
    This file contains the corresponding patientID images to the folder 'stage_2_train_images' and contains the bounding box of areas of pneumonia detected in each image along with a target label of 0 or 1 for pneumonia detected.
    The stage_2_train_labels dataset have 30,227 rows and 6 columns.

  2. stage_2_detailed_class_info.csv:
    This file contains the corresponding patientID along with the target class labels of the images.
    The stage_2_detailed_class has 26,684 training images present.

  3. stage_2_test_images:
    This folder contains all the test dataset chest radiograph DICOM images.

  4. stage_2_sample_submission.csv:
    This file contains the corresponding patientID images to the folder 'stage_2_test_images'.

Importing data and libraries

Stage_2_train_labels:

The dataset train_labels contain details about the patient-id and the co-ordinates of the bounding boxes along with the width and height.
It also contains the Binary classification column as Target which indicate whether the sample image has traces of pneumonia or not.

Thus, the dataset contains information about 26684 patients. Out of these 26684 patients, some of them have multiple entries in the dataset.

The dataset contains data 30,277 records for 26,684 unique patients. Suggesting some patients have more than one entry (more than one bounding boxes detecting pneumonia).

Out of 30,277 entries in the dataset,
No. of positive cases = 9,555 ~ 32%
No. of negative cases = 20,672 ~ 68%

To cross verify, we can count the number of bounding boxes with null values for x, y, width and height.

Thus, there are 23286 unique patients which have only one entry in the dataset. It also has the patientsbounding box, 3266 with 2 bounding box, 119 with 3 bounding box and 13 with 4 bounding box coordinates.

Stage_2_train_class:

The data is divided into 3 labels.

  1. Normal.
  2. Lung opacity.
  3. Not normal/ no lung opacity.

Some information about the data field present in the 'stage_2_detailed_class_info.csv' are:

patientId - A patientId. Each patientId corresponds to a unique image class - Have three values depending what is the current state of the patient's lung: 'No Lung Opacity / Not Normal', 'Normal' and 'Lung Opacity'.

Thus, the dataset contains information about 26684 patients (which is same as that of the train_labels dataframe).

Merged dataset

Target and Class

The Classes and data are not balanced.

Thus, Target = 1 is associated with only class = Lung Opacity whereas Target = 0 is associated with only class = No Lung Opacity / Not Normal as well as Normal.

Bound Box Distribution

We can see that the centers for the bounding box are spread out evenly across the Lungs(red dots). Though a large portion of the bounding box have their centers at the centers of the Lung, but some centers of the box are also located at the edges of lung(yellow portion).

stage_2_train_images:

The images used in radiology are stored in DICOM format.

A DICOM file consists of a header and image data sets packed into a single file. The information within the header is organized as a constant and standardized series of tags. By extracting data from these tags one can access important information regarding the patient demographics, study parameters, etc. Below, we can see the details stored in the dicom files along with the images.

Thus, we can see that in the training images folder we have just 26684 images which is same as that of unique patientId's present in either of the csv files. Thus, we can say that each of the unique patientId's present in either of the csv files corresponds to an image present in the folder.

Merged dataset 2

Merged dataset with DICOM file data:

The data from the DICOM files are now read and put into the dataset.
Fileds like 'Modality', 'PatientAge', 'PatientSex', 'BodyPartExamined', etc have been added to the dataset for further processing.

Going forward we will now use this pickle file as our training data.

Exploratory Data Analysis

In this section, we'll be analysing Final dataset to summarize their main characteristics, often using statistical graphics and other data visualization methods.

We have the final dataset as below, with 30227 rows and 18 columns.

Modality

Body Part Examined

Understanding Different Positions:

ViewPosition
AP : 15297 which is 50.6% of the total data in the dataset
PA : 14930 which is 49.39% of the total data in the dataset
As seen above, two View Positions that are in the training dataset are AP (Anterior/Posterior) and PA (Posterior/Anterior).
These type of X-rays are mostly used to obtain the front-view. Apart from front-view, a lateral image is usually taken to complement the front-view.

We now try to see the distribution of bounding boxes with different ViewPosition (AP and PA).
There are multiple outliers in both the cases.

Conversion Type

Rows and Columns

Patient Sex

Thus, out of 30227 records, there are 17216 records of M (Male) and 13011 records of F (Female).

Thus, we can see that the number of Male patients suffering from Pneumonia is greater when compared with that of Females.

Patient Age

  1. Less than 20 years
  2. Between 20 and 35 years
  3. Between 35 and 50 years
  4. Between 50 and 65 years
  5. Greater than 65%

Plotting DICOM Images: To view the Training images, the DICOM image data is plotted.

As the above subplots are of the images which belong to either "Normal" or "No Lung Opacity / Not Normal", hence no bounding box is observed.

In the above subplots, we can see that the area covered by the box (in red colour) depicts the area of interest i.e., the area in which the opacity is observed in the Lungs.

Overview of the final process

EDA

Model Building and Pre-Processing:

Step-by-step walk through of the solution

Model Building

In this section, we'll be building various models for Pneumonia detection.

Data pre-processing is first step for any AI ML problem and this step is very important for the prediction power of any model. Here we have images in dicom format so first we need to save them in an array so that we could train model with that and secondly we got images in HD resolution which is 1024x1024 so we need to downsize to 128x128,reason for using 128 size images is that we will be using Transfer Learning model called MobileNet pretrained on ImageNet Dataset. Since this problem is not only about prediction of classes (classification problem) also we need to predict coordinates of bounding boxes. So, we will be using UNET architecture for the problem solution. For this we created a user defined function and that function return training images array(X_train) with corresponding mask and target variable.

Since we also observed that data distribution is imbalanced for this, we wrote a script which first separate out two classes images into separate variable then combine full set of images with target label 1 and a subset of randomly picked images with label 0(that subset must be of length equal to the length of the full set for target 1) For instance, if we have 10 images with target label 1 and 100 images with target label 0 then below script will take all 10 images with label 1 and 10 randomly chosen images without replacement from the set of 100 images with label 0 and combine them to create a balanced set of 20 images. So this way we are able to get 1:1 distribution of classes.

Before training the model, we need to divide whole set into training and evaluation set and here for Classification model we took first 5000 images as training set and remaining images is for evaluation of the performance on unseen data. And for the prediction of bounding boxes using UNET architecture we are using different set for training and evaluation. As we thought that if the person is normal then we don’t need to do any prediction of boxes for those images. Hence, we are feeding only those images which has some abnormality present and for them we need to highlight those affected area via bounding boxes. So, for UNET we created another set containing only images with target as 1 (Lung Opacity) and tried to predict mask and bounding boxes thereafter. For Evaluation set we took first 10 images and remaining images for training the UNET model.

Highlights of Base Classifier Model:

Custom functions for modelling

Basic Sequential CNN Model

Highlights of Transfer Learning Model:

MobileNet Classifier Model

InceptionV3 Classifier Model

Model Evaluation

MobileNet Unet Model

Resnet34 Unet Model

InceptionV3 Unet Model

Model Evaluation

Save and Load the best models

Comparison to benchmark

Visualizations

Predicting images from Test folder

Gradio UI to display test predictions in web interface

Screenshot of the Gradio web interface

Pneumonia

Test Image 1

Non Pneumonia

Test Image 2

Implications

Going back to the main aim of this project which is to build an automated pneumonia detection system to locate the position of inflammation in an image, we would say that the presented model would achieve this to a certain extent, but the model can be improved further. Radiologists and technical experts must read and analyse high volumes of reports and present their findings. It's highly possible that they might miss some important information and record incorrect data in their findings which could be critical for the patient.

The current model can serve as the first step of screening through the DICOM images of the patients. After the model has categorized and predicted the bounding boxes, the healthcare specialist can prioritize patients for whom the category comes to be pneumatic and have big bounding boxes. The precise area must be further investigated by the healthcare specialists as the predicted area might be too large or not accurate enough. Also, there will be some false positives generated which can be confirmed by specialist. For those images predicted with very small bounding box, healthcare specialist can look at those with a lower priority as they are deemed to have a higher probability of no pneumonia.

Limitations

Closing Reflections